Overview
Dataset
tatsu-lab/alpaca
Conversations
52002
Analyzed
52002
Coverage
100.0%
Messages
156006
Analyzers
content_pattern, length, quality, diversity, training_quality
Recommendations
27 issues
high
Truncated responses detected
Found 7958 assistant responses (15.3%) that appear to be truncated (ending mid-sentence or with incomplete punctuation). Training on truncated responses may cause the model to generate incomplete outputs. Consider completing or removing these samples.
medium
Outliers detected in diversity vocabulary richness
Found 1612 samples (1.0%) with values outside 3.0 standard deviations from the mean. High outliers: 1612, Low outliers: 0. Consider reviewing these samples for potential data quality issues.
medium
Inconsistent instruction formatting detected
Found multiple instruction format patterns in the dataset: alpaca: 20679, vicuna: 34. Mixing formats may confuse the model and reduce training effectiveness. Consider standardizing to a single format.
low
Outliers detected in content pattern placeholder count
Found 286 samples (0.2%) with values outside 3.0 standard deviations from the mean. High outliers: 286, Low outliers: 0. Consider reviewing these samples for potential data quality issues.
low
Multimodal distribution in length char count
Detected 5 distinct modes in the distribution (confidence: 60%). Mode 1: 98346 samples (61.6%), mean=76.51, std=31.09; Mode 3: 31070 samples (20.7%), mean=158.35, std=13.39; Mode 2: 19896 samples (12.6%), mean=371.49, std=100.23; Mode 5: 5670 samples (4.1%), mean=697.54, std=107.36; Mode 4: 1024 samples (1.0%), mean=1338.58, std=382.77. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
low
Outliers detected in length char count
Found 887 samples (0.6%) that are outliers within their respective modes. Distribution has 5 modes (mode 0: μ=76.51, σ=31.09, mode 2: μ=158.35, σ=13.39, mode 1: μ=371.49, σ=100.23, mode 4: μ=697.54, σ=107.36, mode 3: μ=1338.58, σ=382.77). Outliers are samples more than 3.0 std from mode mean.
low
Multimodal distribution in length word count
Detected 4 distinct modes in the distribution (confidence: 43%). Mode 1: 44369 samples (31.3%), mean=7.83, std=3.03; Mode 4: 84998 samples (50.1%), mean=19.32, std=4.68; Mode 2: 24750 samples (16.6%), mean=69.47, std=24.14; Mode 3: 1889 samples (2.0%), mean=187.98, std=59.21. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
low
Outliers detected in length word count
Found 478 samples (0.3%) that are outliers within their respective modes. Distribution has 4 modes (mode 0: μ=7.83, σ=3.03, mode 3: μ=19.32, σ=4.68, mode 1: μ=69.47, σ=24.14, mode 2: μ=187.98, σ=59.21). Outliers are samples more than 3.0 std from mode mean.
low
Multimodal distribution in length token count
Detected 5 distinct modes in the distribution (confidence: 57%). Mode 1: 92322 samples (55.9%), mean=14.33, std=5.04; Mode 3: 34284 samples (24.4%), mean=28.0, std=2.94; Mode 4: 23431 samples (14.5%), mean=70.98, std=21.11; Mode 2: 4657 samples (3.9%), mean=138.86, std=19.54; Mode 5: 1312 samples (1.2%), mean=257.67, std=77.4. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
low
Outliers detected in length token count
Found 1135 samples (0.7%) that are outliers within their respective modes. Distribution has 5 modes (mode 0: μ=14.33, σ=5.04, mode 2: μ=28.0, σ=2.94, mode 3: μ=70.98, σ=21.11, mode 1: μ=138.86, σ=19.54, mode 4: μ=257.67, σ=77.4). Outliers are samples more than 3.0 std from mode mean.
low
Outliers detected in quality pii count
Found 183 samples (0.1%) with values outside 3.0 standard deviations from the mean. High outliers: 183, Low outliers: 0. Consider reviewing these samples for potential data quality issues.
low
Outliers detected in diversity unique words ratio
Found 1211 samples (0.8%) with values outside 3.0 standard deviations from the mean. High outliers: 0, Low outliers: 1211. Consider reviewing these samples for potential data quality issues.
low
Outliers detected in diversity type token ratio
Found 1211 samples (0.8%) with values outside 3.0 standard deviations from the mean. High outliers: 0, Low outliers: 1211. Consider reviewing these samples for potential data quality issues.
low
Multimodal distribution in training quality instruction word count
Detected 5 distinct modes in the distribution (confidence: 46%). Mode 1: 30488 samples (53.0%), mean=9.29, std=2.35; Mode 3: 13901 samples (31.0%), mean=16.87, std=2.24; Mode 4: 6029 samples (12.4%), mean=26.79, std=4.13; Mode 2: 1390 samples (3.0%), mean=50.38, std=10.58; Mode 5: 194 samples (0.6%), mean=104.11, std=39.85. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
low
Outliers detected in training quality instruction word count
Found 5 samples (0.0%) that are outliers within their respective modes. Distribution has 5 modes (mode 0: μ=9.29, σ=2.35, mode 2: μ=16.87, σ=2.24, mode 3: μ=26.79, σ=4.13, mode 1: μ=50.38, σ=10.58, mode 4: μ=104.11, σ=39.85). Outliers are samples more than 3.0 std from mode mean.
low
Multimodal distribution in training quality response word count
Detected 5 distinct modes in the distribution (confidence: 60%). Mode 2: 21640 samples (39.5%), mean=8.22, std=5.1; Mode 4: 14893 samples (30.0%), mean=39.77, std=12.61; Mode 1: 11777 samples (21.1%), mean=81.28, std=13.12; Mode 5: 2809 samples (6.8%), mean=130.69, std=16.7; Mode 3: 883 samples (2.6%), mean=229.97, std=62.52. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
low
Outliers detected in training quality response word count
Found 20 samples (0.0%) that are outliers within their respective modes. Distribution has 5 modes (mode 1: μ=8.22, σ=5.1, mode 3: μ=39.77, σ=12.61, mode 0: μ=81.28, σ=13.12, mode 4: μ=130.69, σ=16.7, mode 2: μ=229.97, σ=62.52). Outliers are samples more than 3.0 std from mode mean.
low
Empty or near-empty messages detected
Found 1188 messages (0.8%) with 5 or fewer characters. These may indicate data quality issues or placeholder content that should be reviewed.
low
Many short messages detected
Found 29201 messages (18.7%) with fewer than 10 words. This may be intentional (e.g., short responses) or indicate low-quality samples worth reviewing.
low
Low quality samples detected
Found 1 messages (0.0%) with quality scores below 0.5. Average quality score: 1.00. Consider filtering or reviewing low-quality samples before training.
low
Encoding issues detected
Found 2 messages (0.0%) with potential encoding issues (e.g., mojibake, invalid characters). These may indicate data corruption or incorrect character encoding. Consider re-encoding or cleaning affected samples.
low
Highly repetitive content detected
Found 135 messages (0.1%) with high repetition ratios. Repetitive content may cause the model to learn repetitive patterns. Consider reviewing or filtering these samples.
low
Incomplete responses detected
Found 3880 assistant responses (7.5%) with low completeness scores (below 0.5). Average completeness score: 0.90. Incomplete responses may teach the model to generate truncated or minimal outputs. Consider expanding short responses or removing low-quality samples.
low
Placeholder text detected
Found 286 messages (0.2%) containing placeholder text (e.g., [Name], [Company Name], [Your...], [Insert...]). These indicate incomplete or template responses that should be filled in or removed before training.
low
AI hallucinated experiences detected
Found 26 messages (0.0%) containing fabricated first-person experiences (e.g., 'When I was working as a project manager...'). Training on these may cause the model to generate similar hallucinations. Consider removing or rewriting these samples.
low
Nooutput/NA markers detected
Found 38 messages (0.0%) containing nooutput markers (e.g., <nooutput>, N/A, None). These are unusable training samples and should be removed.
low
AI refusal patterns detected
Found 10 messages (0.0%) containing AI refusal patterns (e.g., 'I cannot provide...', 'I'm unable to help...'). These indicate the model refused to complete the task. Consider reviewing or removing these samples unless training for appropriate refusals.
Distributions
10 chartsContent Pattern Placeholder Count
Content Pattern Suspicious Url Count
Content Pattern Content Pattern Score
Length Char Count (5 modes)
multimodal
5 distinct modes detected
Mode 1
61.6%
Mean
76.5
Std
31.1
Count
98346
Mode 3
20.7%
Mean
158.3
Std
13.4
Count
31070
Mode 2
12.6%
Mean
371.5
Std
100.2
Count
19896
Mode 5
4.1%
Mean
697.5
Std
107.4
Count
5670
Mode 4
1.0%
Mean
1338.6
Std
382.8
Count
1024
Length Word Count (4 modes)
multimodal
4 distinct modes detected
Mode 1
31.3%
Mean
7.8
Std
3.0
Count
44369
Mode 4
50.1%
Mean
19.3
Std
4.7
Count
84998
Mode 2
16.6%
Mean
69.5
Std
24.1
Count
24750
Mode 3
2.0%
Mean
188.0
Std
59.2
Count
1889
Length Token Count (5 modes)
multimodal
5 distinct modes detected
Mode 1
55.9%
Mean
14.3
Std
5.0
Count
92322
Mode 3
24.4%
Mean
28.0
Std
2.9
Count
34284
Mode 4
14.5%
Mean
71.0
Std
21.1
Count
23431
Mode 2
3.9%
Mean
138.9
Std
19.5
Count
4657
Mode 5
1.2%
Mean
257.7
Std
77.4
Count
1312
Quality Pii Count
Quality Repetition Ratio
Quality Quality Score
Diversity Unique Words Ratio
Anomaly Detection
5 visualizationsOutliers in Content Pattern Placeholder Count
286 outliersOutliers in Content Pattern Suspicious Url Count
1 outliersOutliers in Content Pattern Content Pattern Score
358 outliersOutliers in Length Char Count
3195 outliersOutliers in Length Word Count
3119 outliersMessage Statistics
| Metric | Distribution | Mean | Std | Min | Max | Median |
|---|---|---|---|---|---|---|
| text_content_placeholder_count | unimodal | 0.0 | 0.08 | 0.0 | 8.0 | 0.0 |
| text_content_url_count | unimodal | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| text_content_pattern_score | unimodal | 1.0 | 0.01 | 0.1 | 1.0 | 1.0 |
| text_content_char_count | multimodal (5) | 161.28 | 181.72 | 0.0 | 4181.0 | 105.0 |
| └ Mode 1 (61.6%) | 98346 samples | 76.51 | 31.09 | - | - | - |
| └ Mode 3 (20.7%) | 31070 samples | 158.35 | 13.39 | - | - | - |
| └ Mode 2 (12.6%) | 19896 samples | 371.49 | 100.23 | - | - | - |
| └ Mode 5 (4.1%) | 5670 samples | 697.54 | 107.36 | - | - | - |
| └ Mode 4 (1.0%) | 1024 samples | 1338.58 | 382.77 | - | - | - |
| text_content_word_count | multimodal (4) | 26.05 | 29.75 | 0.0 | 717.0 | 16.0 |
| └ Mode 1 (31.3%) | 44369 samples | 7.83 | 3.03 | - | - | - |
| └ Mode 4 (50.1%) | 84998 samples | 19.32 | 4.68 | - | - | - |
| └ Mode 2 (16.6%) | 24750 samples | 69.47 | 24.14 | - | - | - |
| └ Mode 3 (2.0%) | 1889 samples | 187.98 | 59.21 | - | - | - |
| text_content_token_count | multimodal (5) | 31.61 | 36.48 | 0.0 | 958.0 | 18.0 |
| └ Mode 1 (55.9%) | 92322 samples | 14.33 | 5.04 | - | - | - |
| └ Mode 3 (24.4%) | 34284 samples | 28.0 | 2.94 | - | - | - |
| └ Mode 4 (14.5%) | 23431 samples | 70.98 | 21.11 | - | - | - |
| └ Mode 2 (3.9%) | 4657 samples | 138.86 | 19.54 | - | - | - |
| └ Mode 5 (1.2%) | 1312 samples | 257.67 | 77.4 | - | - | - |
| text_content_pii_count | unimodal | 0.0 | 0.11 | 0.0 | 26.0 | 0.0 |
| text_content_repetition_ratio | unimodal | 0.0 | 0.02 | 0.0 | 0.83 | 0.0 |
| text_content_quality_score | unimodal | 1.0 | 0.01 | 0.46 | 1.0 | 1.0 |
| text_content_words_ratio | unimodal | 0.88 | 0.11 | 0.0 | 1.0 | 0.88 |
| text_content_token_ratio | unimodal | 0.88 | 0.11 | 0.0 | 1.0 | 0.88 |
| text_content_vocabulary_richness | unimodal | 3.87 | 1.29 | 0.0 | 14.95 | 3.5 |
| text_content_clarity_score | unimodal | 0.96 | 0.08 | 0.5 | 1.0 | 1.0 |
| text_content_quality_instruction_word_count | multimodal (5) | 14.79 | 10.71 | 4.0 | 414.0 | 12.0 |
| └ Mode 1 (53.0%) | 30488 samples | 9.29 | 2.35 | - | - | - |
| └ Mode 3 (31.0%) | 13901 samples | 16.87 | 2.24 | - | - | - |
| └ Mode 4 (12.4%) | 6029 samples | 26.79 | 4.13 | - | - | - |
| └ Mode 2 (3.0%) | 1390 samples | 50.38 | 10.58 | - | - | - |
| └ Mode 5 (0.6%) | 194 samples | 104.11 | 39.85 | - | - | - |
| text_content_quality_response_word_count | multimodal (5) | 44.18 | 44.97 | 0.0 | 717.0 | 30.0 |
| └ Mode 2 (39.5%) | 21640 samples | 8.22 | 5.1 | - | - | - |
| └ Mode 4 (30.0%) | 14893 samples | 39.77 | 12.61 | - | - | - |
| └ Mode 1 (21.1%) | 11777 samples | 81.28 | 13.12 | - | - | - |
| └ Mode 5 (6.8%) | 2809 samples | 130.69 | 16.7 | - | - | - |
| └ Mode 3 (2.6%) | 883 samples | 229.97 | 62.52 | - | - | - |
| text_content_completeness_score | unimodal | 0.9 | 0.25 | 0.0 | 1.0 | 1.0 |
| text_content_quality_score | unimodal | 1.0 | 0.0 | 0.8 | 1.0 | 1.0 |
Conversation Turns
| Statistic | Value |
|---|---|
| Count | 52002 |
| Mean | 3.0 |
| Std | 0.0 |
| Min | 3 |
| Max | 3 |
| Median | 3.0 |